Introduction

In this part of this study, for initial modeling and analysis, I will be looking at the total number of thefts from January 2001 - March 11, 2023 that were reported. Note that due to the large size of the original dataset (nearly 8 million rows), the raw data is not included in this repository. The raw data can be accessed here.

Data Overview

For this partition of the data, there are two variables: year/month and the number of thefts reported in each month. The full dataset has more variables, which are described below. Each row in the full dataset represents an individual crime that was reported.

Variables

ID Unique identifier for the record.
Case number Chicago id for the case number
Date Date when the incident occurred, this is sometimes a best estimate.
Block The partially redacted address where the incident occurred.
IUCR The Illinois Uniform Crime Reporting Code.
Primary Type The primary description of the IUCR code.
Description The secondary description of the IUCR code.
Location description The primary description of the location where the incident occurred.
Arrest Whether or not the incident resulted in an arrest.
Domestic Whether or not the incident was a domestic incident.
Beat Indicates the beat where the incident occurred.
District The police district where the incident occurred.
Ward The city council district where the incident occurred.
Community Area The community area where the incident occurred.
FBI Code FBI Code crime classification.
X Coordinate The X coordinate location where the incident occurred.
Y coordinate The Y coordinate where the incident occurred.
Year Year the incident occurred.
Updated on Date and time the record was last updated.
Latitude The latitude where the incident occurred.
Longitude The longitude where the incident occurred.
Location The location of the incident.

Population Data Overview

The population data for this project was extracted using the US Census API ACS, which provided population data from 2005-2019 from Cook County. Since Cook County is the base of the Chicago metropolitan area, I chose to extract this data, and include a variable that scaled the number of thefts over time by the yearly population. So far, I haven’t been able to track down monthly population data for the Chicago area. Note that also the data from 2001-2005 is harder to access as it is only available through the population estimates API. Therefore, although the original dataset goes from 2001-2023, at this time, I am focusing on 2005-2019. The population from the Chicago website is limited to 2018-2022.

thefts <- read_csv("data/thefts_per_capita.csv") %>%
  select(-year)
## Rows: 180 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Month
## dbl (4): NumThefts, year, population, TheftsPer1000
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chicago_population <- read_csv("data/cook_county_estimates_2005-2019.csv") %>%
  mutate(year = year + 2004) %>%
  pivot_wider(names_from = variable, values_from = estimate)
## Rows: 30 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): NAME, variable
## dbl (3): year, GEOID, estimate
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(thefts)
## # A tibble: 6 × 4
##   Month    NumThefts population TheftsPer1000
##   <chr>        <dbl>      <dbl>         <dbl>
## 1 2005 Jan      6114    5206357          1.17
## 2 2005 Feb      5592    5206357          1.07
## 3 2005 Mar      6570    5206357          1.26
## 4 2005 Apr      7029    5206357          1.35
## 5 2005 May      7276    5206357          1.40
## 6 2005 Jun      7408    5206357          1.42
head(chicago_population)
## # A tibble: 6 × 5
##    year GEOID NAME                  median_income total_pop
##   <dbl> <dbl> <chr>                         <dbl>     <dbl>
## 1  2005 17031 Cook County, Illinois         48950   5206357
## 2  2006 17031 Cook County, Illinois         50691   5288655
## 3  2007 17031 Cook County, Illinois         52564   5285107
## 4  2008 17031 Cook County, Illinois         54582   5294664
## 5  2009 17031 Cook County, Illinois         52539   5287037
## 6  2010 17031 Cook County, Illinois         51466   5200950
thefts_ts <- thefts %>%
  mutate(Month = yearmonth(Month)) %>%
  as_tsibble(index = Month)
chicago_population_ts <- chicago_population %>%
  as_tsibble(key = c(GEOID, NAME), index = year)

Data Exploration

Time Series Plots of Variables

Raw Number of Thefts

chicago_crime <- read.csv("data/thefts_by_month.csv")
chicago_crime <- chicago_crime %>%
  select(-X) %>%
  rename(NumThefts = sum.Count.) %>%
  drop_na(month)
head(chicago_crime)
##   year month NumThefts
## 1 2022    10      5224
## 2 2015     2      3228
## 3 2019    10      5390
## 4 2001     1      7867
## 5 2017     3      4493
## 6 2008     8      8501
chicago_crime_monthly <- chicago_crime %>%
 mutate(month = month.name[month]) %>%
  mutate(Month = str_c(year, month, sep = " ")) %>%
  select(Month, NumThefts) %>%
  mutate(Month = yearmonth(Month)) %>%
  filter(year(Month) < 2023) %>%
  as_tsibble(index = Month)
head(chicago_crime_monthly)
## # A tsibble: 6 x 2 [1M]
##      Month NumThefts
##      <mth>     <int>
## 1 2001 Jan      7867
## 2 2001 Feb      6669
## 3 2001 Mar      7765
## 4 2001 Apr      7686
## 5 2001 May      8420
## 6 2001 Jun      8612
chicago_crime_monthly %>%
  autoplot(NumThefts)

chicago_crime_monthly %>%
  ACF(NumThefts) %>%
  autoplot()

chicago_crime_monthly %>%
  gg_lag(NumThefts, lags = 1:12)

chicago_crime_monthly %>%
  gg_season(NumThefts)

chicago_crime_monthly %>%
  gg_subseries(NumThefts)

Population Of Cook County

Note: I am working on getting the population of Chicago instead, but this is more complicated to mine from and will be a stretch goal.

chicago_population_ts %>%
  autoplot(total_pop)

Thefts in Chicago Per 1000 Residents in Cook County

thefts_ts %>%
  autoplot(TheftsPer1000)

Models

Classical Decomposition of the Number of Thefts Per 1000 Residents of Cook County

Multiplicative Decomposition

thefts_ts %>%
  gg_lag(TheftsPer1000, lags = 1:12)

thefts_ts %>%
  gg_season(TheftsPer1000)

Initially, I will test the four simple models that we’ve covered in class: naive, seasonal naive, mean, and random walk with drift. There appears to be a stronger seasonal component than trend component, so my guess is that seasonal naive will probably perform the best, but I will test all 4 models with a test set and predicted set.

Raw Count of Thefts

Forecasting Number of Thefts from 2017-2019

Model

thefts_train <- thefts_ts %>%
  filter(year(Month) <= 2017)
raw_count.model <- thefts_train %>%
  model(
    naive = NAIVE(NumThefts),
    snaive = SNAIVE(NumThefts),
    mean = MEAN(NumThefts),
    lm = RW(NumThefts ~ drift()),
    
  ) 
raw_count.model.forecast <- raw_count.model %>%
  forecast(h = "2 years")
raw_count.model.forecast %>%
  autoplot(filter(thefts_ts, year(Month) >= 2015))

Plot

raw_count.model.forecast %>%
  autoplot(thefts_ts)

Tables

raw_count.model.forecast %>%
  accuracy(thefts_ts)
## # A tibble: 4 × 10
##   .model .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lm     Test   -15.2  657.  558.  -1.90 10.9  1.46  1.37  0.656
## 2 mean   Test  -896.  1108.  922. -18.7  19.1  2.42  2.32  0.654
## 3 naive  Test   -73.1  656.  540.  -2.99 10.7  1.41  1.37  0.654
## 4 snaive Test   -41.5  247.  201.  -1.05  3.92 0.526 0.517 0.131

Forecasting Number of Thefts Per Capita

percapita.model <- thefts_train %>%
  model(
    naive = NAIVE(TheftsPer1000),
    snaive = SNAIVE(TheftsPer1000),
    mean = MEAN(TheftsPer1000),
    lm = RW(TheftsPer1000 ~ drift()),
    
  ) 
percapita.model.forecast <- percapita.model %>%
  forecast(h = "2 years")
percapita.model.forecast %>%
  autoplot(thefts_ts)

Plot

raw_count.model.forecast %>%
  accuracy(thefts_ts)
## # A tibble: 4 × 10
##   .model .type     ME  RMSE   MAE    MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lm     Test   -15.2  657.  558.  -1.90 10.9  1.46  1.37  0.656
## 2 mean   Test  -896.  1108.  922. -18.7  19.1  2.42  2.32  0.654
## 3 naive  Test   -73.1  656.  540.  -2.99 10.7  1.41  1.37  0.654
## 4 snaive Test   -41.5  247.  201.  -1.05  3.92 0.526 0.517 0.131

Tables

percapita.model.forecast %>%
  accuracy(thefts_ts)
## # A tibble: 4 × 10
##   .model .type       ME   RMSE    MAE     MPE  MAPE  MASE RMSSE   ACF1
##   <chr>  <chr>    <dbl>  <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 lm     Test   0.00620 0.127  0.109   -0.992 11.0  1.50  1.40  0.656 
## 2 mean   Test  -0.155   0.200  0.163  -16.9   17.5  2.24  2.21  0.654 
## 3 naive  Test  -0.00500 0.126  0.106   -2.08  10.7  1.45  1.39  0.654 
## 4 snaive Test   0.00106 0.0458 0.0371  -0.157  3.73 0.509 0.506 0.0919

Residuals of the Best Method

As I expected, the best performing model appears to be the seasonal naive model, across all measures. Now, I will look at the residuals to see whether they look like white noise.

raw_count.snaive <- thefts_train %>%
  model(
    snaive = SNAIVE(TheftsPer1000)
  )
raw_count.snaive %>%
  gg_tsresiduals()
## Warning: Removed 12 rows containing missing values (`geom_line()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).

There is still significant autocorrelation in these residuals, and the distribution looks slightly left skewed. Therefore, the seasonal naive method is not appropriate for predicting the number of thefts over time.

Exploring the Unemployment Variable

unemployment <- read_csv("data/unemployment_long.csv")
## Rows: 288 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Month
## dbl (1): value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
unemployment_ts <- unemployment %>%
  mutate(Month = yearmonth(Month)) %>%
  as_tsibble(index = Month)
unemployment_ts %>%
  autoplot(value)
## Warning: Removed 10 rows containing missing values (`geom_line()`).

Classical Decomposition

Additive

unemployment_ts %>%
  gg_lag(value, lags = 1:12)
## Warning: Removed 10 rows containing missing values (gg_lag).

unemployment_ts %>%
  gg_season()
## Plot variable not specified, automatically selected `y = value`
## Warning: Removed 10 rows containing missing values (`geom_line()`).

unemployment_ts %>%
  gg_subseries(value)
## Warning: Removed 1 row containing missing values (`geom_line()`).

fc <- unemployment_ts %>%
  stretch_tsibble(.init = 3, .step = 1) %>%
  filter(.id < max(.id)) %>%
  model(
    snaive = SNAIVE(value),
    lm = RW(value ~ drift()),
    mean = MEAN(value),
    naive = NAIVE(value),
  ) %>% forecast(h = 3) |>
  group_by(.id) |>
  mutate(h = row_number()) |>
  ungroup() |>
  as_fable(response = "value", distribution = value)
unemployment_ts |>
  model(RW(value ~ drift())) |>
  gg_tsresiduals()
## Warning: Removed 11 rows containing missing values (`geom_line()`).
## Warning: Removed 11 rows containing missing values (`geom_point()`).
## Warning: Removed 11 rows containing non-finite values (`stat_bin()`).

chicago_population <- read_csv("data/Chicago_Population_Counts.csv") %>%
  filter(`Geography Type` == "Citywide")
## Rows: 235 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): Geography Type, Geography, Record ID
## dbl (23): Year, Population - Total, Population - Age 0-17, Population - Age ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chicago_pop_ts <- chicago_population %>%
  as_tsibble(index = Year, key = c(`Geography Type`, `Geography`, `Record ID`))
# Data from the Chicago.gov website related to population
chicago_pop_ts %>%
  ggplot(aes(x = Year, y = `Population - Total`)) + geom_line() + geom_point()

chicago_crime_with_chicago_pop <- chicago_crime_monthly %>%
  mutate(Year = year(Month))
chicago_crime_with_chicago_pop <- chicago_crime_with_chicago_pop %>%
  filter(Year >= 2018) %>%
  inner_join(chicago_pop_ts, by = "Year") %>%
    mutate(TheftsPerCapitaChicago = NumThefts / `Population - Total` * 1000)
unique(chicago_crime_with_chicago_pop$`Geography Type`)
## [1] "Citywide"
unique(chicago_crime_with_chicago_pop$Geography)
## [1] "Chicago"
chicago_crime_with_chicago_pop %>%
  autoplot(TheftsPerCapitaChicago)

chicago_crime_with_chicago_pop %>%
  autoplot(NumThefts)